Introduction to Data Science with R

Session 7: Joins and strings

Ina Bornkessel-Schlesewsky

January 9, 2024

Roadmap for today

  • Joining data frames
  • Working with strings (text)
  • Put these skills to the test by using R to solve a murder mystery!

Joining data frames

Why join data frames?

  • Data of interest will often involve more than table
  • To conduct a meaningful analysis and address your questions of interest, you will need to join these
  • Interrelated tables of data are called relational data

The following brief overview is based on R4DS, Chapter 19.

Example: ncyflights13

  • The nycflights13 package contains data for flights that departed New York City airports in 2013
  • We will look at five tibbles from this package:
    • flights: on-time information for all flights
    • airports: information about airports (identifier = FAA airport code)
    • airlines: full name of airlines and their carrier code
    • planes: information about each plane
    • weather: information about the weather at each NYC airport (each hour)

Flights

head(flights, n=3)
# A tibble: 3 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Airports and planes

head(airports, n=3)
# A tibble: 3 × 8
  faa   name                            lat   lon   alt    tz dst   tzone       
  <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…
3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A     America/Chi…
head(planes,n=3)
# A tibble: 3 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…

Airlines and weather

head(airlines, n=3)
# A tibble: 3 × 2
  carrier name                  
  <chr>   <chr>                 
1 9E      Endeavor Air Inc.     
2 AA      American Airlines Inc.
3 AS      Alaska Airlines Inc.  
head(weather,n=3)
# A tibble: 3 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

Relationships between tables


from R4DS

Keys

  • The variables that connect the tables are called keys
  • Primary key: uniquely identifies an observation in its own table (e.g. tailnum uniquely identifies each individual plane in planes)
  • Foreign key: uniquely identifies observations in another table (tailnum in flights, because it links each flight to a particular plane)

Joining data frames: left join

We can add airline name information to the flights data using left_join(). A left join keeps all the rows of the “left” (first) data frame and joins matching rows from the right data frame.

flights2 <- flights |> 
  mutate(date = make_date(year = year, 
                          month = month, day = day)) |> 
  select(date,dep_time,carrier,flight,origin,dest)
  

flights2 |> 
  left_join(airlines, by = "carrier")
# A tibble: 336,776 × 7
   date       dep_time carrier flight origin dest  name                    
   <date>        <int> <chr>    <int> <chr>  <chr> <chr>                   
 1 2013-01-01      517 UA        1545 EWR    IAH   United Air Lines Inc.   
 2 2013-01-01      533 UA        1714 LGA    IAH   United Air Lines Inc.   
 3 2013-01-01      542 AA        1141 JFK    MIA   American Airlines Inc.  
 4 2013-01-01      544 B6         725 JFK    BQN   JetBlue Airways         
 5 2013-01-01      554 DL         461 LGA    ATL   Delta Air Lines Inc.    
 6 2013-01-01      554 UA        1696 EWR    ORD   United Air Lines Inc.   
 7 2013-01-01      555 B6         507 EWR    FLL   JetBlue Airways         
 8 2013-01-01      557 EV        5708 LGA    IAD   ExpressJet Airlines Inc.
 9 2013-01-01      557 B6          79 JFK    MCO   JetBlue Airways         
10 2013-01-01      558 AA         301 LGA    ORD   American Airlines Inc.  
# ℹ 336,766 more rows

Joining data frames: inner join

An inner join keeps only the matching rows in both data frames. Say we were only interested in flights by Hawaiian Airlines (HA) and Alaska Airlines (AS) and we had a modified airlines tibble with only these two airlines.

airlines2 <- airlines |> 
  filter(carrier %in% c("HA","AS"))

flights2 |> 
  inner_join(airlines2, by = "carrier")
# A tibble: 1,056 × 7
   date       dep_time carrier flight origin dest  name                  
   <date>        <int> <chr>    <int> <chr>  <chr> <chr>                 
 1 2013-01-01      724 AS          11 EWR    SEA   Alaska Airlines Inc.  
 2 2013-01-01      857 HA          51 JFK    HNL   Hawaiian Airlines Inc.
 3 2013-01-01     1808 AS           7 EWR    SEA   Alaska Airlines Inc.  
 4 2013-01-02      722 AS          11 EWR    SEA   Alaska Airlines Inc.  
 5 2013-01-02      909 HA          51 JFK    HNL   Hawaiian Airlines Inc.
 6 2013-01-02     1818 AS           7 EWR    SEA   Alaska Airlines Inc.  
 7 2013-01-03      724 AS          11 EWR    SEA   Alaska Airlines Inc.  
 8 2013-01-03      914 HA          51 JFK    HNL   Hawaiian Airlines Inc.
 9 2013-01-03     1817 AS           7 EWR    SEA   Alaska Airlines Inc.  
10 2013-01-04      725 AS          11 EWR    SEA   Alaska Airlines Inc.  
# ℹ 1,046 more rows

Schematics of different join types

Inner join


from R4DS

Keeps only the observations for which the keys match across tables.

Schematics of different join types

Outer joins

  • left join keeps all observations in x

  • right join keeps all observations in y

  • full join keeps all observations in x and y

  • NAs are added when needed

  • left_join() is the default joining operation, as it lets you add extra info to a table (like the carrier name to the flights data)

from R4DS

Schematics of different join types

Outer joins

from R4DS

Joining with different key names

What if we wanted to add full airport names to the weather data? In weather the key name is origin, while in airports, it is faa.


# A tibble: 3 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>
# A tibble: 3 × 8
  faa   name                            lat   lon   alt    tz dst   tzone       
  <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…
3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A     America/Chi…

Joining with different key names

We can modify the by argument to do this.


weather |> 
  select(origin:humid) |> 
  left_join(airports, by = c("origin" = "faa"))
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid name         lat   lon   alt
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl> <chr>      <dbl> <dbl> <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4 Newark Li…  40.7 -74.2    18
 2 EWR     2013     1     1     2  39.0  27.0  61.6 Newark Li…  40.7 -74.2    18
 3 EWR     2013     1     1     3  39.0  28.0  64.4 Newark Li…  40.7 -74.2    18
 4 EWR     2013     1     1     4  39.9  28.0  62.2 Newark Li…  40.7 -74.2    18
 5 EWR     2013     1     1     5  39.0  28.0  64.4 Newark Li…  40.7 -74.2    18
 6 EWR     2013     1     1     6  37.9  28.0  67.2 Newark Li…  40.7 -74.2    18
 7 EWR     2013     1     1     7  39.0  28.0  64.4 Newark Li…  40.7 -74.2    18
 8 EWR     2013     1     1     8  39.9  28.0  62.2 Newark Li…  40.7 -74.2    18
 9 EWR     2013     1     1     9  39.9  28.0  62.2 Newark Li…  40.7 -74.2    18
10 EWR     2013     1     1    10  41    28.0  59.6 Newark Li…  40.7 -74.2    18
# ℹ 26,105 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Working with strings

What is a string?

  • Strings are ordered sequences of characters
  • We have already encountered them, e.g. when looking at how to create objects, when specifying criteria for filtering, or when setting titles and axis labels for our plots
  • For additional information about strings and how to create them, see R4DS, Ch.14
s <- "this is a string"

penguins |> 
  filter(species == "Gentoo") |> 
  head(n=3)
# A tibble: 3 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Gentoo  Biscoe           46.1          13.2               211        4500
2 Gentoo  Biscoe           50            16.3               230        5700
3 Gentoo  Biscoe           48.7          14.1               210        4450
# ℹ 2 more variables: sex <fct>, year <int>

Further string operations

  • We can do more with strings than just check to see if they are identical (as in the example on the previous slide)
  • The {stringr} package (part of the “tidyverse”) makes available a number of convenient functions1, e.g. to
    • split strings into substrings: str_sub()
    • detect patterns in strings: str_detect()
    • extract patterns from strings: str_extract()

A new data set

The following runs through a few additional tips and tricks that might help with your report. We will use a new data set on songs that were in the Spotify world charts in 2020 / 2021 (source: Kaggle; https://www.kaggle.com/sashankpillai/spotify-top-200-charts-20202021/version/2).1

Load and inspect the spotify data

spotify <-  read_csv("spotify_dataset.csv") |>  
  clean_names()

glimpse(spotify)
Rows: 1,556
Columns: 23
$ index                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
$ highest_charting_position <dbl> 1, 2, 1, 3, 5, 1, 3, 2, 3, 8, 4, 9, 5, 1, 2,…
$ number_of_times_charted   <dbl> 8, 3, 11, 5, 1, 18, 16, 10, 8, 10, 43, 9, 3,…
$ week_of_highest_charting  <chr> "2021-07-23--2021-07-30", "2021-07-23--2021-…
$ song_name                 <chr> "Beggin'", "STAY (with Justin Bieber)", "goo…
$ streams                   <dbl> 48633449, 47248719, 40162559, 37799456, 3394…
$ artist                    <chr> "Måneskin", "The Kid LAROI", "Olivia Rodrigo…
$ artist_followers          <dbl> 3377762, 2230022, 6266514, 83293380, 5473565…
$ song_id                   <chr> "3Wrjm47oTz2sjIgck11l5e", "5HCyWlXZPP0y6Gqq8…
$ genre                     <chr> "['indie rock italiano', 'italian pop']", "[…
$ release_date              <chr> "2017-12-08", "2021-07-09", "2021-05-21", "2…
$ weeks_charted             <chr> "2021-07-23--2021-07-30\n2021-07-16--2021-07…
$ popularity                <dbl> 100, 99, 99, 98, 96, 97, 94, 95, 96, 95, 89,…
$ danceability              <dbl> 0.714, 0.591, 0.563, 0.808, 0.736, 0.610, 0.…
$ energy                    <dbl> 0.800, 0.764, 0.664, 0.897, 0.704, 0.508, 0.…
$ loudness                  <dbl> -4.808, -5.484, -5.044, -3.712, -7.409, -6.6…
$ speechiness               <dbl> 0.0504, 0.0483, 0.1540, 0.0348, 0.0615, 0.15…
$ acousticness              <dbl> 0.12700, 0.03830, 0.33500, 0.04690, 0.02030,…
$ liveness                  <dbl> 0.3590, 0.1030, 0.0849, 0.3640, 0.0501, 0.38…
$ tempo                     <dbl> 134.002, 169.928, 166.928, 126.026, 149.995,…
$ duration_ms               <dbl> 211560, 141806, 178147, 231041, 212000, 1378…
$ valence                   <dbl> 0.5890, 0.4780, 0.6880, 0.5910, 0.8940, 0.75…
$ chord                     <chr> "B", "C#/Db", "A", "B", "D#/Eb", "G#/Ab", "G…

Selecting substrings

Look at the most popular chords. str_sub() allows you to select a substring from a longer string (e.g. to extract the start of a word).

spotify |>  
  mutate(chord_simplified = str_sub(chord,1,1)) |> 
  count(chord_simplified, sort = TRUE)
# A tibble: 8 × 2
  chord_simplified     n
  <chr>            <int>
1 C                  369
2 G                  266
3 F                  252
4 A                  240
5 D                  165
6 B                  141
7 E                  112
8 <NA>                11

Detecting patterns

Look only at pop songs by using str_detect() to detect a pattern:

spotify  |>  
  filter(str_detect(genre, "pop"))  |>  
  select(song_name, artist, genre)
# A tibble: 819 × 3
   song_name                         artist         genre                       
   <chr>                             <chr>          <chr>                       
 1 Beggin'                           Måneskin       ['indie rock italiano', 'it…
 2 good 4 u                          Olivia Rodrigo ['pop']                     
 3 Bad Habits                        Ed Sheeran     ['pop', 'uk pop']           
 4 INDUSTRY BABY (feat. Jack Harlow) Lil Nas X      ['lgbtq+ hip hop', 'pop rap…
 5 MONTERO (Call Me By Your Name)    Lil Nas X      ['lgbtq+ hip hop', 'pop rap…
 6 Kiss Me More (feat. SZA)          Doja Cat       ['dance pop', 'pop']        
 7 Todo De Ti                        Rauw Alejandro ['puerto rican pop', 'trap …
 8 I WANNA BE YOUR SLAVE             Måneskin       ['indie rock italiano', 'it…
 9 Levitating (feat. DaBaby)         Dua Lipa       ['dance pop', 'pop', 'uk po…
10 Permission to Dance               BTS            ['k-pop', 'k-pop boy group']
# ℹ 809 more rows

Extract patterns

str_extract() extracts the text corresponding to a pattern. In this instance, extract occurrences of pop and write to a new column:

spotify  |>  
  select(song_name, artist, genre)  |>  
  mutate(genre_simple = str_extract(genre,"pop"))
# A tibble: 1,556 × 4
   song_name                         artist         genre           genre_simple
   <chr>                             <chr>          <chr>           <chr>       
 1 Beggin'                           Måneskin       ['indie rock i… pop         
 2 STAY (with Justin Bieber)         The Kid LAROI  ['australian h… <NA>        
 3 good 4 u                          Olivia Rodrigo ['pop']         pop         
 4 Bad Habits                        Ed Sheeran     ['pop', 'uk po… pop         
 5 INDUSTRY BABY (feat. Jack Harlow) Lil Nas X      ['lgbtq+ hip h… pop         
 6 MONTERO (Call Me By Your Name)    Lil Nas X      ['lgbtq+ hip h… pop         
 7 Kiss Me More (feat. SZA)          Doja Cat       ['dance pop', … pop         
 8 Todo De Ti                        Rauw Alejandro ['puerto rican… pop         
 9 Yonaguni                          Bad Bunny      ['latin', 'reg… <NA>        
10 I WANNA BE YOUR SLAVE             Måneskin       ['indie rock i… pop         
# ℹ 1,546 more rows

Extract data from strings

  • The {tidyr} package contains some useful functions for taking a string and separating it into multiple strings
  • This is useful when you have cells in a data frame that contain multiple entries (e.g. useful for tidying when these correspond to multiple variables)
  • separate_longer_delim() separates these into rows
  • separate_wider_delim() separates these into columns

Separate into rows

spotify |> 
  select(song_name, artist, genre) |> 
  separate_longer_delim(genre, delim=",")
# A tibble: 4,665 × 3
   song_name                         artist         genre                   
   <chr>                             <chr>          <chr>                   
 1 Beggin'                           Måneskin       "['indie rock italiano'"
 2 Beggin'                           Måneskin       " 'italian pop']"       
 3 STAY (with Justin Bieber)         The Kid LAROI  "['australian hip hop']"
 4 good 4 u                          Olivia Rodrigo "['pop']"               
 5 Bad Habits                        Ed Sheeran     "['pop'"                
 6 Bad Habits                        Ed Sheeran     " 'uk pop']"            
 7 INDUSTRY BABY (feat. Jack Harlow) Lil Nas X      "['lgbtq+ hip hop'"     
 8 INDUSTRY BABY (feat. Jack Harlow) Lil Nas X      " 'pop rap']"           
 9 MONTERO (Call Me By Your Name)    Lil Nas X      "['lgbtq+ hip hop'"     
10 MONTERO (Call Me By Your Name)    Lil Nas X      " 'pop rap']"           
# ℹ 4,655 more rows

To remove extraneous characters, check out str_remove(). We won’t be covering separate_wider_delim() here as it is a bit more difficult to use without a fixed number of strings to extract. Just be aware that it exists.

Patterns?

The pattern detection / extraction capabilities of these functions have much more to offer, via so-called regular expressions. See R4DS, Ch.15 for further details – many more options than can be covered here. Let’s return to the {nycflights13} data to look at just a couple of examples.

Detecting patterns in strings

Let’s say we wanted to find all airports containing the word “New”. We can use str_detect() to do this.

airports |> 
  filter(str_detect(name,"New"))
# A tibble: 19 × 8
   faa   name                               lat    lon   alt    tz dst   tzone  
   <chr> <chr>                            <dbl>  <dbl> <dbl> <dbl> <chr> <chr>  
 1 EHM   Cape Newenham Lrrs                58.6 -162.    541    -9 A     Americ…
 2 ESN   Easton-Newnam Field Airport       38.8  -76.1    72    -5 A     Americ…
 3 EWB   New Bedford Regional Airport      41.7  -71.0    80    -5 A     Americ…
 4 EWK   Newton City-County Airport        38.1  -97.3  1533    -6 A     Americ…
 5 EWR   Newark Liberty Intl               40.7  -74.2    18    -5 A     Americ…
 6 GON   Groton New London                 41.3  -72.0     9    -5 A     Americ…
 7 HVN   Tweed-New Haven Airport           41.3  -72.9    14    -5 A     Americ…
 8 ILG   New Castle                        39.7  -75.6    79    -5 A     Americ…
 9 JCI   New Century AirCenter Airport     38.8  -94.9  1087    -6 A     Americ…
10 KNW   New Stuyahok Airport              59.4 -157.    302    -9 A     Americ…
11 MSY   Louis Armstrong New Orleans Intl  30.0  -90.3     4    -6 A     Americ…
12 NBG   New Orleans Nas Jrb               29.8  -90.0     3    -6 A     Americ…
13 ONP   Newport Municipal Airport         44.6 -124.    160    -8 A     Americ…
14 PHF   Newport News Williamsburg Intl    37.1  -76.5    43    -5 A     Americ…
15 UUU   Newport State                     41.5  -71.3   172    -5 A     Americ…
16 WWT   Newtok Airport                    60.9 -165.     25    -9 U     Americ…
17 ZRP   Newark Penn Station               40.7  -74.2     0    -5 A     Americ…
18 ZRZ   New Carrollton Rail Station       38.9  -76.9    39    -5 A     Americ…
19 ZVE   New Haven Rail Station            41.3  -72.9     7    -5 A     Americ…

Detecting patterns in strings

  • str_detect() is case sensitive:
airports  |>  
  filter(str_detect(name,"new"))
# A tibble: 1 × 8
  faa   name               lat   lon   alt    tz dst   tzone            
  <chr> <chr>            <dbl> <dbl> <dbl> <dbl> <chr> <chr>            
1 GNU   Goodnews Airport  59.1 -162.    15    -9 A     America/Anchorage
  • but we can specify alternatives in square brackets:
airports |>  
  filter(str_detect(name,"[Nn]ew"))
# A tibble: 20 × 8
   faa   name                               lat    lon   alt    tz dst   tzone  
   <chr> <chr>                            <dbl>  <dbl> <dbl> <dbl> <chr> <chr>  
 1 EHM   Cape Newenham Lrrs                58.6 -162.    541    -9 A     Americ…
 2 ESN   Easton-Newnam Field Airport       38.8  -76.1    72    -5 A     Americ…
 3 EWB   New Bedford Regional Airport      41.7  -71.0    80    -5 A     Americ…
 4 EWK   Newton City-County Airport        38.1  -97.3  1533    -6 A     Americ…
 5 EWR   Newark Liberty Intl               40.7  -74.2    18    -5 A     Americ…
 6 GNU   Goodnews Airport                  59.1 -162.     15    -9 A     Americ…
 7 GON   Groton New London                 41.3  -72.0     9    -5 A     Americ…
 8 HVN   Tweed-New Haven Airport           41.3  -72.9    14    -5 A     Americ…
 9 ILG   New Castle                        39.7  -75.6    79    -5 A     Americ…
10 JCI   New Century AirCenter Airport     38.8  -94.9  1087    -6 A     Americ…
11 KNW   New Stuyahok Airport              59.4 -157.    302    -9 A     Americ…
12 MSY   Louis Armstrong New Orleans Intl  30.0  -90.3     4    -6 A     Americ…
13 NBG   New Orleans Nas Jrb               29.8  -90.0     3    -6 A     Americ…
14 ONP   Newport Municipal Airport         44.6 -124.    160    -8 A     Americ…
15 PHF   Newport News Williamsburg Intl    37.1  -76.5    43    -5 A     Americ…
16 UUU   Newport State                     41.5  -71.3   172    -5 A     Americ…
17 WWT   Newtok Airport                    60.9 -165.     25    -9 U     Americ…
18 ZRP   Newark Penn Station               40.7  -74.2     0    -5 A     Americ…
19 ZRZ   New Carrollton Rail Station       38.9  -76.9    39    -5 A     Americ…
20 ZVE   New Haven Rail Station            41.3  -72.9     7    -5 A     Americ…

Detecting patterns in strings

  • We can further specify our patterns, e.g. by using ^ to specify the beginning of a string
airports |>  
  filter(str_detect(name,"^New"))
# A tibble: 14 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 EWB   New Bedford Regional Airport    41.7  -71.0    80    -5 A     America/…
 2 EWK   Newton City-County Airport      38.1  -97.3  1533    -6 A     America/…
 3 EWR   Newark Liberty Intl             40.7  -74.2    18    -5 A     America/…
 4 ILG   New Castle                      39.7  -75.6    79    -5 A     America/…
 5 JCI   New Century AirCenter Airport   38.8  -94.9  1087    -6 A     America/…
 6 KNW   New Stuyahok Airport            59.4 -157.    302    -9 A     America/…
 7 NBG   New Orleans Nas Jrb             29.8  -90.0     3    -6 A     America/…
 8 ONP   Newport Municipal Airport       44.6 -124.    160    -8 A     America/…
 9 PHF   Newport News Williamsburg Intl  37.1  -76.5    43    -5 A     America/…
10 UUU   Newport State                   41.5  -71.3   172    -5 A     America/…
11 WWT   Newtok Airport                  60.9 -165.     25    -9 U     America/…
12 ZRP   Newark Penn Station             40.7  -74.2     0    -5 A     America/…
13 ZRZ   New Carrollton Rail Station     38.9  -76.9    39    -5 A     America/…
14 ZVE   New Haven Rail Station          41.3  -72.9     7    -5 A     America/…

Put your knowledge to the test!

Solve a murder mystery in R

  • We will be using the {reclues} package in R to get access to the materials we need
  • This is an R adaptation of an exercise first developed by the Knight lab at Northwestern University as an exercise to help with learning SQL (a database query language). Here is a link to their walkthrough

  • Start by installing the {reclues} package using:
    • devtools::install_github("sciencificity/reclues")
  • You may need to install the {devtools} package first

Instructions

A crime has taken place and the detective needs your help. The detective gave you the crime scene report, but you somehow lost it. You vaguely remember that the crime was a murder that occurred sometime on Jan.15, 2018 and that it took place in SQL City. All the clues to this mystery are buried in a huge database, and you need to use your R skills to navigate through this vast network of information. Your first step to solving the mystery is to retrieve the corresponding crime scene report from the police department’s database.

from https://sciencificity.github.io/reclues/, adapted from http://mystery.knightlab.com/walkthrough.html

The database

from http://mystery.knightlab.com/walkthrough.html

The database

This is a relational database, comprising a number of interconnected tables (dataframes in R). The golden key shows when a column is a primary identifier for the entries in that dataframe (also known as a “primary key”), while the blue arrows signal that the column can be used to link the data to entries in another dataframe. (Think joins! 😉) Note that the ID variables may have different names depending on the dataframe!

In R …

from https://sciencificity.github.io/reclues/

These dataframes are available to you once you have installed and loaded the reclues package.

Getting started

  • Start by looking at the crime_scene_report dataframe to locate the crime scene report for the crime in question
  • Use your knowledge regarding the crime – it was a murder that occurred in SQL City on Jan 15th 2018 – and your R skills to isolate the appropriate row in the dataframe
  • Then read the description and follow the clues by using the other dataframes available to you
head(crime_scene_report)
# A tibble: 6 × 4
      date type    description                                             city 
     <int> <chr>   <chr>                                                   <chr>
1 20180115 robbery A Man Dressed as Spider-Man Is on a Robbery Spree       NYC  
2 20180115 murder  Life? Dont talk to me about life.                       Alba…
3 20180115 murder  Mama, I killed a man, put a gun against his head...     Reno 
4 20180215 murder  REDACTED REDACTED REDACTED                              SQL …
5 20180215 murder  Someone killed the guard! He took an arrow to the knee! SQL …
6 20180115 theft   Big Bully stole my lunch money!                         Chic…

A challenge


  • See whether you can solve the crime without manually looking through any of the tables – use your data manipulation skills instead!
  • Try to use at least one joining and one string manipulation operation
  • Hint: once you have narrowed it down to the obvious suspect, have a think about whether this is indeed the final solution